This coursework focuses on applying a multivariate linear regression model to predict the total cost for two travelers staying for four nights in an Airbnb in Vienna, Austria. The data set is quite large and raw (non-numerical values, variables with several NA values etc.), so this is as much a data management project as it is a modelling assignment. In the exploratory analysis section, we first start by analysing the data to identify potential deterministic relationships. We also make sure to tidy up the data by transforming non-numerical variables and by tailoring certain variables to optimise a final model. This analysis is aided by some visualisations, which aid our interpretation of the data and variable selection. The model fitting consists of applying a forward stepwise regression i.e. we test if the addition of a variable improves our model’s explanatory ability iteratively whilst controlling for/partialling out the effect of other variables. We apply a significance level of 5% throughout the analysis. We perform a log-linear regression which means that we transform the target variable (price for two people to stay in Vienna for four nights) to its natural log. This means we can interpret the coefficients on the explanatory variables as approximate expected percentage changes in price for four nights. This methodology has been well-researched, so our contribution is strictly empirical.
Our optimal model reads as follows: \(Y = 3.64 - 0.44(Private\ Room) - 0.57(Shared\ Room) - 0.17(T5) - 0.13(T4) + 0.05(T2) + 0.38(Innere\ Stadt) + 0.29(Rating) + 0.09(Accommodates) + 0.14(Bathrooms) + 0.06(Superhost) + 0.01(Availability\_30) + \epsilon_i\) - where Y = ln(price_4_nights).
This analysis is very approachable for the average person given the obvious logic behind each of the variables. We can interpret the following factors which influence the average expected price to stay in Vienna for four nights as follows:
We first load the data and parse the price variable into an integer.
# Price is a character, parse to integer
vienna_listings <- vienna_listings %>%
mutate(price = parse_number(price))
# In case of duplicate rows
unique_listings = unique(vienna_listings[,])The following code plots on the map all AirBnBs where minimum_nights is less than equal to four.
leaflet(data = filter(vienna_listings, minimum_nights <= 4)) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
fillColor = "blue",
fillOpacity = 0.4,
popup = ~listing_url,
label = ~property_type)Using this map, we can be redirected to any airbnb we want to check out along with an overview of their location. We know move on to exploratory analysis.
We normally conduct a thorough EDA via three methods:
However, we analyse the correlation matrices later in the process when we have narrowed down further our “meaningful” variables, i.e. those with a suspected possible significant explanatory ability of the variation of price.
# Use Glimpse from the Dplyr package to get an initial look at the raw data
kable(dplyr::glimpse(vienna_listings))Our data has 11583 rows/observations with 74 variables.
# Generate summary statistics with mosaic and skimr packages
kable(skimr::skim(vienna_listings))There are 37 independent numeric variables, 9 logical, 5 date and 22 character variables.
Since there are 73 independent variables and many of them don’t make any sense (eg. url) for price prediction, we select some variables for simplifying later exploration. According to our hypothesis, the following variables could potentially affect the price.
# Logical screening
vienna_listings_cleaned <- vienna_listings %>%
select(
price,
host_response_time,
host_response_rate,
host_acceptance_rate,
host_is_superhost,
host_total_listings_count,
host_identity_verified,
neighbourhood_cleansed, #23 areas
latitude,
longitude,
property_type,
room_type,
bathrooms_text,
bedrooms,
accommodates,
beds,
minimum_nights,
maximum_nights,
availability_30,
availability_365,
instant_bookable,
number_of_reviews_ltm,
number_of_reviews,
review_scores_rating,
reviews_per_month
)#glimpse and skim again
kable(glimpse(vienna_listings_cleaned))
skimr::skim(vienna_listings_cleaned) # kable is too slow with skimThen, we plot the distribution of price. As the distribution is significantly right skewed, we use the log function to aptly transform the price data.
# Visualise the Price
vienna_listings_cleaned %>% ggplot(aes(x=price)) +
geom_density()+labs(
title = "Price PDF")+
theme_economist()# Our data is very positively skewed which will make prediction more difficult, so we take the natural log of price to normalise the data
vienna_listings_cleaned <- vienna_listings_cleaned %>%
mutate(lnprice = log(price))
# Visualise the Data
vienna_listings_cleaned %>% ggplot(aes(x=lnprice)) +
geom_density()+labs(
title = "Log Price PDF")+
theme_economist()confint(vienna_listings_cleaned$price, mean, level = 0.9)| 5% | 95% |
|---|---|
| 20 | 169 |
As the log price is still a bit right skewed, we then plot the 5% and 95% percentile of the price. This gives us approximate price upper and lower bounds of €20 and €169 respectively.
# Remove positive outliers and non-sensical/irrelevant values left of the mean to build a more meaningful analsysis
vienna_listings_cleaned <- vienna_listings_cleaned %>%
filter(between(lnprice, log(20), log(169)), na.RM=TRUE)
# Visualise the Data
vienna_listings_cleaned %>% ggplot(aes(x=lnprice)) +
geom_density()+labs(
title = "Log Price PDF (Tails Removed)")+
theme_economist()# save this dataset
vienna_listings_final <- vienna_listings_cleanedNow the natural log of price is approximatley normally distributed within these bounds; thus, we use lnprice as our dependent variable to construct our model.
We then change the non-numerical variables such as bathrooms, host_response_rate, and host_acceptance_rate to numerical variables.
vienna_listings_final <- vienna_listings_final%>%
mutate(bathrooms = parse_number(bathrooms_text),
host_response_rate = parse_number(host_response_rate),
host_acceptance_rate = parse_number(host_acceptance_rate))%>%
select(-bathrooms_text)#glimpse and skim again
kable(glimpse(vienna_listings_final))
kable(skim(vienna_listings_final))# Create factor variables for host response time
kable(unique(vienna_listings_final$host_response_time)) %>%
kable_styling()| x |
|---|
| within an hour |
| N/A |
| within a few hours |
| within a day |
| a few days or more |
| NA |
invisible(vienna_listings_final %>%
mutate(
# mark "N/A" as missing values
host_response_time = ifelse(host_response_time =="N/A",NA,host_response_time),
# relevel
host_response_time = fct_relevel(host_response_time,
"within an hour",
"within a few hours",
"within a day",
"a few days or more"
)))
# Create factor variables for room types
kable(unique(vienna_listings_final$room_type))%>%
kable_styling()| x |
|---|
| Hotel room |
| Entire home/apt |
| Private room |
| Shared room |
invisible(vienna_listings_final$room_type <- factor(vienna_listings_final$room_type,
labels = unique(vienna_listings_final$room_type)))In the original raw data, there are 74 Variables with 11,583 Observations. After filtering the data based on log price (to remove NA price values and outliers), we have 10,602 observations.
After variable selection based on logic, we have 26 variables.
The following variables are numbers: Numeric Variables * 19: host_total_listings_count, host_response_rate, host_acceptance_rate, accommodates, longitude, latitude, beds, bedrooms,bathrooms, variables for maximum and minimum nights, number of reviews, availabiity data, review scores data, calculated host data etc.
The following are categorical or factor variables: Logical/Boolean Variables * 3 : host_is_superhost, host_identity_verified, instant_bookable Character variables * 4 : host_response_time, neighbourhood_cleansed, property_type, room_type
Next, we look at the variable property_type. We use the count function to determine how many categories there are their frequency. The top 4 most common property types are “Entire rental unit”, “Private room in rental unit”, " Entire condominium (condo)“, and”Entire serviced apartment". The four types account for 91.19% proportion of the total listings.
# Identify the amount of each property type
property_type_count <- vienna_listings_final %>%
count(property_type) %>%
mutate(percentage = n/sum(n)*100)%>%
arrange(desc(n))
kable(property_type_count)| property_type | n | percentage |
|---|---|---|
| Entire rental unit | 6975 | 65.7894737 |
| Private room in rental unit | 1915 | 18.0626297 |
| Entire condominium (condo) | 485 | 4.5746086 |
| Entire serviced apartment | 293 | 2.7636295 |
| Entire loft | 154 | 1.4525561 |
| Private room in residential home | 103 | 0.9715148 |
| Private room in condominium (condo) | 98 | 0.9243539 |
| Entire residential home | 83 | 0.7828712 |
| Room in hotel | 66 | 0.6225241 |
| Room in boutique hotel | 51 | 0.4810413 |
| Private room in bed and breakfast | 48 | 0.4527448 |
| Private room in hostel | 45 | 0.4244482 |
| Shared room in rental unit | 41 | 0.3867195 |
| Room in serviced apartment | 29 | 0.2735333 |
| Room in aparthotel | 24 | 0.2263724 |
| Private room in loft | 21 | 0.1980758 |
| Entire guest suite | 19 | 0.1792115 |
| Private room in townhouse | 15 | 0.1414827 |
| Private room in guesthouse | 13 | 0.1226184 |
| Private room in serviced apartment | 11 | 0.1037540 |
| Entire townhouse | 10 | 0.0943218 |
| Shared room in hostel | 10 | 0.0943218 |
| Entire place | 9 | 0.0848896 |
| Entire bungalow | 8 | 0.0754575 |
| Private room in villa | 7 | 0.0660253 |
| Entire villa | 6 | 0.0565931 |
| Entire guesthouse | 5 | 0.0471609 |
| Private room in bungalow | 5 | 0.0471609 |
| Room in bed and breakfast | 5 | 0.0471609 |
| Shared room in bed and breakfast | 5 | 0.0471609 |
| Camper/RV | 4 | 0.0377287 |
| Shared room in condominium (condo) | 4 | 0.0377287 |
| Entire cottage | 3 | 0.0282965 |
| Private room | 3 | 0.0282965 |
| Private room in casa particular | 3 | 0.0282965 |
| Casa particular | 2 | 0.0188644 |
| Private room in castle | 2 | 0.0188644 |
| Private room in guest suite | 2 | 0.0188644 |
| Shared room in residential home | 2 | 0.0188644 |
| Shared room in serviced apartment | 2 | 0.0188644 |
| Tiny house | 2 | 0.0188644 |
| Dome house | 1 | 0.0094322 |
| Entire cabin | 1 | 0.0094322 |
| Entire chalet | 1 | 0.0094322 |
| Entire hostel | 1 | 0.0094322 |
| Lighthouse | 1 | 0.0094322 |
| Private room in barn | 1 | 0.0094322 |
| Private room in bus | 1 | 0.0094322 |
| Private room in camper/rv | 1 | 0.0094322 |
| Private room in cave | 1 | 0.0094322 |
| Private room in earth house | 1 | 0.0094322 |
| Private room in farm stay | 1 | 0.0094322 |
| Private room in nature lodge | 1 | 0.0094322 |
| Private room in treehouse | 1 | 0.0094322 |
| Shared room in loft | 1 | 0.0094322 |
Top 4:
Since the vast majority of the observations in the data are one of the top two property types (84%), we would like to combine Entire condominium (condo), Entire serviced apartment and Entire loft as a type called Entire apartment. This method was better suited to the Vienna dataset, we require sufficient datapoints for each category in order to reduce our model’s standard errors.
We then create a simplified version of property_type variable that has 4 categories: the top two categories, Entire apartment and Other.
# First we need to summarize the other values in the Category "Others"
vienna_listings_final <- vienna_listings_final %>%
mutate(prop_type_simplified = case_when(
property_type %in% c("Entire rental unit","Private room in rental unit") ~ property_type,
property_type %in% c( "Entire condominium (condo)", "Entire serviced apartment","Entire loft") ~ "Entire apartment",
TRUE ~ "Other"
))# Check if done correctly
kable(vienna_listings_final %>%
count(property_type, prop_type_simplified) %>%
arrange(desc(n)))| property_type | prop_type_simplified | n |
|---|---|---|
| Entire rental unit | Entire rental unit | 6975 |
| Private room in rental unit | Private room in rental unit | 1915 |
| Entire condominium (condo) | Entire apartment | 485 |
| Entire serviced apartment | Entire apartment | 293 |
| Entire loft | Entire apartment | 154 |
| Private room in residential home | Other | 103 |
| Private room in condominium (condo) | Other | 98 |
| Entire residential home | Other | 83 |
| Room in hotel | Other | 66 |
| Room in boutique hotel | Other | 51 |
| Private room in bed and breakfast | Other | 48 |
| Private room in hostel | Other | 45 |
| Shared room in rental unit | Other | 41 |
| Room in serviced apartment | Other | 29 |
| Room in aparthotel | Other | 24 |
| Private room in loft | Other | 21 |
| Entire guest suite | Other | 19 |
| Private room in townhouse | Other | 15 |
| Private room in guesthouse | Other | 13 |
| Private room in serviced apartment | Other | 11 |
| Entire townhouse | Other | 10 |
| Shared room in hostel | Other | 10 |
| Entire place | Other | 9 |
| Entire bungalow | Other | 8 |
| Private room in villa | Other | 7 |
| Entire villa | Other | 6 |
| Entire guesthouse | Other | 5 |
| Private room in bungalow | Other | 5 |
| Room in bed and breakfast | Other | 5 |
| Shared room in bed and breakfast | Other | 5 |
| Camper/RV | Other | 4 |
| Shared room in condominium (condo) | Other | 4 |
| Entire cottage | Other | 3 |
| Private room | Other | 3 |
| Private room in casa particular | Other | 3 |
| Casa particular | Other | 2 |
| Private room in castle | Other | 2 |
| Private room in guest suite | Other | 2 |
| Shared room in residential home | Other | 2 |
| Shared room in serviced apartment | Other | 2 |
| Tiny house | Other | 2 |
| Dome house | Other | 1 |
| Entire cabin | Other | 1 |
| Entire chalet | Other | 1 |
| Entire hostel | Other | 1 |
| Lighthouse | Other | 1 |
| Private room in barn | Other | 1 |
| Private room in bus | Other | 1 |
| Private room in camper/rv | Other | 1 |
| Private room in cave | Other | 1 |
| Private room in earth house | Other | 1 |
| Private room in farm stay | Other | 1 |
| Private room in nature lodge | Other | 1 |
| Private room in treehouse | Other | 1 |
| Shared room in loft | Other | 1 |
Next we can make a factor out of prop_type_simplified.
vienna_listings_final <- vienna_listings_final %>%
mutate(
prop_type_simplified = fct_relevel(prop_type_simplified,
"Entire rental unit",
"Private room in rental unit",
"Entire apartment",
"Other")) %>%
select(-property_type)We only want to include listings in our regression analysis that are intended for travel purposes, as this is the target user for our model.
# right skewed
vienna_listings_final %>%
ggplot(aes(x=minimum_nights))+
geom_histogram() +
theme_economist() +
labs(title="Minimum Nights Distribution")vienna_listings_final %>%
filter(minimum_nights<=10) %>%
mutate(minimum_nights=as.integer(minimum_nights)) %>%
ggplot(aes(x=minimum_nights))+
geom_bar()+
scale_x_continuous(breaks = seq(0,10,1)) +
theme_economist() +
labs(title="Minimum Nights Distribution (Outliers Removed)")Filter the airbnb data so that it only includes observations with minimum_nights <= 4
vienna_listings_final <- vienna_listings_final %>%
filter(minimum_nights<=4) %>%
filter(maximum_nights>=4) # Since we will be modelling the price to stay for four nights, maximum stay must be greater than or equal to 4
invisible(vienna_listings_final)We now have only 8,844 listings.
We want to leverage the longitude and latitude variables by calculating the distance between every listing and center of Vienna. We take postcode 1010 as the center of Vienna, which we confirmed with a local citizen.
# Calculate the distance, meters as unit
vienna_listings_final <- vienna_listings_final %>%
rowwise() %>%
mutate(
dist_from_cent = distm(c(latitude, longitude), c(48.208604427334215, 16.37353507157435),
fun = distVincentyEllipsoid)[1,1]
) We then compare median distance of each neighbourhood. Use median to avoid outliers.
dist_neighbourhood <- vienna_listings_final %>%
group_by(neighbourhood_cleansed) %>%
summarise(
med_dist = median(dist_from_cent),
med_price = median(price),
count_listings = n()
) %>%
arrange(med_dist)
kable(dist_neighbourhood) %>%
kable_styling()| neighbourhood_cleansed | med_dist | med_price | count_listings |
|---|---|---|---|
| Innere Stadt | 525.6608 | 107.0 | 417 |
| Wieden | 1874.1312 | 61.0 | 372 |
| Leopoldstadt | 2062.9222 | 60.0 | 962 |
| Landstra§e | 2374.3012 | 63.0 | 762 |
| Brigittenau | 2695.4915 | 55.0 | 356 |
| Alsergrund | 2697.6900 | 60.0 | 582 |
| Josefstadt | 2774.0108 | 58.0 | 290 |
| Margareten | 2952.5362 | 53.0 | 493 |
| Neubau | 3068.8588 | 64.0 | 545 |
| Mariahilf | 3150.7270 | 65.0 | 404 |
| Favoriten | 3740.6468 | 56.0 | 524 |
| W<U+008A>hring | 4064.0061 | 55.0 | 256 |
| D<U+009A>bling | 4669.0753 | 54.5 | 176 |
| Hernals | 4980.3359 | 50.0 | 279 |
| Ottakring | 5029.5999 | 49.0 | 407 |
| Rudolfsheim-F<U+009F>nfhaus | 5056.9721 | 55.0 | 656 |
| Meidling | 5347.0372 | 55.0 | 365 |
| Simmering | 6171.0211 | 55.0 | 112 |
| Floridsdorf | 6537.6258 | 55.0 | 156 |
| Penzing | 8115.2166 | 51.0 | 232 |
| Donaustadt | 9265.7734 | 63.0 | 327 |
| Hietzing | 10790.6303 | 59.0 | 110 |
| Liesing | 11179.5870 | 69.0 | 61 |
We now plot Neighbourhood median price against median distance from city center
dist_neighbourhood %>%
ggplot(aes(x=med_dist,y=med_price,size=count_listings))+
geom_point(color="#1A5276",alpha=0.4)+
geom_text_repel(aes(label=neighbourhood_cleansed),size=4)+
scale_size_continuous(range=c(4,20))+
theme_bw()+
theme(
plot.title =element_text(size=16, face='bold',hjust = 0,margin = margin(10,0,10,0)),
plot.subtitle =element_text(size=16, hjust = 0),
axis.text.x = element_text(size=10),
axis.text.y = element_text(size=12),
axis.ticks.x = element_line(),
axis.ticks.y=element_line(),
axis.title.x = element_text(size=12,face='bold'),
axis.title.y = element_text(size=12,face='bold'),
) +
labs(title = "Neighbourhood median price against median distance from city center",
x="Median distance from city center",y="Median Price")+
theme_economist_white()Innere Stadt is the neighbourhood located right at city center (587.45m in average); thus, it exhibits a high average price.
Neighbourhoods located less than 6 km to city center offer listings for price in range 55 to 65, with price decreasing when distance increases on average.
Interestingly, listings located more than 9km away have higher price for 60-70. There appears to be no certain pattern/relationship between distance from the center and price, but we will investigate further later.
# Grouping by neighbourhood_cleansed
neighbourhood_sorted = vienna_listings_final %>%
group_by(neighbourhood_cleansed) %>%
summarise(mean_price = mean(price)) %>%
arrange(-mean_price)
kable(neighbourhood_sorted) %>%
kable_styling()| neighbourhood_cleansed | mean_price |
|---|---|
| Innere Stadt | 106.51319 |
| Landstra§e | 70.88320 |
| Wieden | 69.38978 |
| Liesing | 69.14754 |
| Mariahilf | 68.78960 |
| Neubau | 67.91009 |
| Leopoldstadt | 67.06757 |
| Donaustadt | 65.33333 |
| Alsergrund | 65.32646 |
| Hietzing | 64.89091 |
| Brigittenau | 63.81461 |
| Josefstadt | 63.80000 |
| D<U+009A>bling | 63.50568 |
| W<U+008A>hring | 62.71094 |
| Margareten | 62.65314 |
| Favoriten | 62.40840 |
| Rudolfsheim-F<U+009F>nfhaus | 60.36280 |
| Simmering | 59.44643 |
| Floridsdorf | 58.80769 |
| Meidling | 58.16164 |
| Hernals | 57.46237 |
| Penzing | 56.77155 |
| Ottakring | 56.02211 |
We would like to divide our neighbourhoods into five different areas, as for the 23 we currently have, estimation error embedded in our final model would be too high out of sample, potentially giving our out-of-sample predictions massive standard errors. We decide to group the areas by similar mean prices. Since Innere Stadt is such an outlier in terms of price, we divide the rest of the areas into four groups (5,6,6,5).
Note that however, we get unicode characters in three of the neighbourhood_cleansed variables. However, we can avoid any hassle and simply call the variable from the table of neighbourhood_cleansed mean prices. We will later assess our groupings to see if they are statistically significant predictors of price.
# First we need to summarize the other values in the Category "Others"
vienna_listings_final <- vienna_listings_final %>%
mutate(neighbourhood_simplified = case_when(
neighbourhood_cleansed %in% c("Innere Stadt") ~ neighbourhood_cleansed,
neighbourhood_cleansed %in% c("Wieden", "Landstra§e", "Mariahilf", "Neubau", "Liesing") ~ "T2",
neighbourhood_cleansed %in% c("Leopoldstadt", "Alsergrund", "Hietzing", "Donaustadt", "Josefstadt") ~ "T3",
neighbourhood_cleansed %in% c(neighbourhood_sorted[12,1], "Brigittenau", "Favoriten", neighbourhood_sorted[15,1], "Margareten", "Floridsdorf") ~ "T4",
neighbourhood_cleansed %in% c(neighbourhood_sorted[18,1], "Meidling", "Simmering", "Penzing", "Hernals", "Ottakring") ~ "T5"
))
# Now we need to factorise the new variable
vienna_listings_final <- vienna_listings_final %>%
mutate(
neighbourhood_simplified = fct_relevel(neighbourhood_simplified,
"Innere Stadt",
"T2",
"T3",
"T4",
"T5")) %>%
select(-neighbourhood_cleansed)# Do boxplot for each neighbourhood simplified
vienna_listings_final %>%
filter(!is.na(neighbourhood_simplified)) %>%
ggplot(aes(y=price)) +
geom_boxplot(aes(x=fct_reorder(neighbourhood_simplified,price,median,.desc=TRUE)))+
labs(title = "Median Price by Neighbourhood")+
theme_economist()+
theme(
axis.title.x= element_blank()) # have to put this after we use economist theme or it puts the x axis title backWe can see see the gradual decline in median price from T2 to T5, but, again, Innere Stadt really stands out as the most in demand location for travelers visiting Vienna. As we saw earlier however, the median price for the locations doesn’t appear to be dependent on location in general. We cross-checked the logic of this with our resident Vienna group member, who confirmed that some of the nicest areas which are generlly in quite high demand from tourists are outside the city centre. Since Vienna’s attractions lie in its architecture and general beauty, we can assume that it is probably a more mature tourist audience, who may want to avoid the hustle and bustle of the city, seeing as nightlife isn’t the primary attraction of Vienna.
We can reasonably delete longitude and latitude from dataset now.
vienna_listings_final <- vienna_listings_final %>%
select(-c(longitude,latitude))For the target variable \(Y\), we will use the cost for two people to stay at an Airbnb location for four nights. We will create a new variable called
price_4_nightsthat usesprice, andaccomodatesto calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.
vienna_listings_reg <- vienna_listings_final %>%
filter(accommodates>=2) %>%
mutate(price_4_nights = price * 4,
lnprice_4_nights = log(price_4_nights))# Visualise the Price
vienna_listings_reg %>% ggplot(aes(price_4_nights)) +
geom_density()+
labs(title = "Price for Four Nights")+
theme_economist()# Visualise the Price
vienna_listings_reg %>% ggplot(aes(x=lnprice_4_nights)) +
geom_density()+
labs(title = "Log Price for Four Nights")+
theme_economist()We use log(price_4_nights) for the regression model for the reasons mentioned earlier. The positive skew in the price_4_nights variable makes it more difficult to predict extreme values of price, relative to the normalised natural logarithm.
There some missing values in review_scores_rating, which we are quite confident will be a significant predictor. We decide to remove these rows (the lm regression in R would remove these automatically but we prefer to do it now to have the data cleaner). We also decide that it makes sense to remove very new or “unrated” properties as data on reviewed properties will on average be much more reliable, and we still have a sufficient number of datapoints to formulate a model.
vienna_listings_reg <- vienna_listings_reg %>%
filter(!is.na(review_scores_rating)) %>%
filter(number_of_reviews>=10)# Set seed so we will get the same results
set.seed(202110)
# Split our data 75% as train and 25% as test
train_test_split <- initial_split(vienna_listings_reg, prop=0.75)
vienna_train <- training(train_test_split)
vienna_test <- testing(train_test_split)Model selection is a structured process. We employ a forward regression process, as we test the significance of variables whilst controlling for others in order to avoid including variables with high collinearity. We first assess the correlation matrix of our numerical explanatory variables in order to determine which variables have the strongest relationships with the target variable, and to highlight variables with high collinearity so as to avoid multicollinearity issues amongst the regressors. We then iterate through a number of models in order to determine if the increase in adjusted r squared is worth the increase in exposure to estimation error in a stepwise forward regression processz. When building regression models with a large number of potential independent variables, we must be cautious of the tradeoff between in-sample bias and out of sample variance of predictions. To optimise our model based on this tradeoff, we can compare the adjusted R squared against the out of sample RMSE (from the testing data). We also perform other model diagnostics such as analysis of the residuals, in order to ensure that they are approximately normal and homoskedastic with a mean of zero.
# Scatterplot/Correlation matrix for numerical variables
vienna_train %>%
select(where(is.numeric)) %>%
select(-price )%>%
ggpairs()+
theme_economist()This correlogram gives us a better indication of which variables may explain more of the target variable, which we will use to gear our forward regression in the right direction.
We can also see there are many variables which have a strong correlation with each other, eg. beds and accommodates. This means we need to consider multicollinearity in the regression amongst the predictors.
However, some scatterplots cannot support a linear relationship between variables, eg. price and total listings of the host. There could be some correlations conditional on the value of a categorical variable also which we need to be careful of, eg. the correlation between price and bedrooms could be conditional on property type or room type, since whether the listing is entire or a private room for rental might influence the number of bedrooms. For this reason we simply use the correlogram as a limited source of guidance to help us start off in the right direction
We fit regression model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.
model1 <- lm(lnprice_4_nights ~ prop_type_simplified+number_of_reviews+review_scores_rating, data = vienna_train)
par(mfrow=c(2,2))
msummary(model1) Estimate Std. Error t value
(Intercept) 4.377e+00 1.650e-01 26.520
prop_type_simplifiedPrivate room in rental unit -6.240e-01 2.110e-02 -29.574
prop_type_simplifiedEntire apartment 6.026e-02 2.871e-02 2.099
prop_type_simplifiedOther -1.726e-01 3.588e-02 -4.812
number_of_reviews -1.791e-04 9.057e-05 -1.977
review_scores_rating 2.559e-01 3.469e-02 7.375
Pr(>|t|)
(Intercept) < 2e-16 ***
prop_type_simplifiedPrivate room in rental unit < 2e-16 ***
prop_type_simplifiedEntire apartment 0.0359 *
prop_type_simplifiedOther 1.57e-06 ***
number_of_reviews 0.0481 *
review_scores_rating 2.09e-13 ***
Residual standard error: 0.3965 on 3105 degrees of freedom
Multiple R-squared: 0.238, Adjusted R-squared: 0.2368
F-statistic: 194 on 5 and 3105 DF, p-value: < 2.2e-16
kable(car::vif(model1)) %>%
kable_styling()| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| prop_type_simplified | 1.006153 | 3 | 1.001023 |
| number_of_reviews | 1.004995 | 1 | 1.002494 |
| review_scores_rating | 1.003322 | 1 | 1.001659 |
autoplot(model1) +
theme_minimal() +
labs (title = "Model 1 Diagnostic Plots")Metrics:
The Coefficients:
We now want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model.
model2 <- lm(lnprice_4_nights ~ prop_type_simplified+room_type+number_of_reviews+review_scores_rating, data = vienna_train)
msummary(model2) Estimate Std. Error t value
(Intercept) 4.408e+00 1.637e-01 26.926
prop_type_simplifiedPrivate room in rental unit -8.108e-02 7.767e-02 -1.044
prop_type_simplifiedEntire apartment 6.050e-02 2.838e-02 2.132
prop_type_simplifiedOther 1.637e-01 5.778e-02 2.833
room_typeEntire home/apt -2.654e-02 1.502e-01 -0.177
room_typePrivate room -5.428e-01 7.482e-02 -7.255
room_typeShared room -9.182e-01 1.504e-01 -6.105
number_of_reviews -1.744e-04 8.973e-05 -1.943
review_scores_rating 2.493e-01 3.441e-02 7.246
Pr(>|t|)
(Intercept) < 2e-16 ***
prop_type_simplifiedPrivate room in rental unit 0.29660
prop_type_simplifiedEntire apartment 0.03312 *
prop_type_simplifiedOther 0.00464 **
room_typeEntire home/apt 0.85975
room_typePrivate room 5.04e-13 ***
room_typeShared room 1.16e-09 ***
number_of_reviews 0.05206 .
review_scores_rating 5.40e-13 ***
Residual standard error: 0.392 on 3102 degrees of freedom
Multiple R-squared: 0.2561, Adjusted R-squared: 0.2542
F-statistic: 133.5 on 8 and 3102 DF, p-value: < 2.2e-16
kable(car::vif(model2)) %>%
kable_styling()| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| prop_type_simplified | 16.933444 | 3 | 1.602474 |
| room_type | 17.052748 | 3 | 1.604350 |
| number_of_reviews | 1.009458 | 1 | 1.004718 |
| review_scores_rating | 1.010395 | 1 | 1.005184 |
autoplot(model2) +
theme_minimal() +
labs (title = "Model 2 Diagnostic Plots")The adjusted R squared is 0.2542, which is a marginal improvement on the previous model at the cost of three extra sources of estimation error, which will increase the variance of our predicted means
We also get very high variance inflation factors for property type and room type, understandably. This is also indicated by the deviation of the previously significant property coefficient for private room in rental unit to being insignificant. Given that the room type variable only very marginally improved the model’s explanatory value, and introduces a lot of multicollinearity amongst the regressors, we decide to drop either the room_type or property_type variable. When we tested the above model with either room_type or prop_type_simplified, the room_type regression produced a higher standard error. We also prefer the room_type categories as they are more applicable to how a user would appraise a property
We want to test if the number of bathrooms, bedrooms, beds, or size of the house (accomodates) are significant predictors of price_4_nights. However, we suspect a high degree of multicollinearity betwen these variables, so we make a correlation matrix between these variables to determine if there is a high corelation between the predictors, and which one demonstrates the strongest relationship with the target variable.
vienna_train %>%
select(lnprice_4_nights, bathrooms, bedrooms, accommodates, beds) %>%
ggpairs()+
labs(title = "Correlation Matrix") Excluding bathrooms, there is a high degree of correlation between the predictors, and accommodates demonstrates the strongest relationship with the target variable. This makes sense, seeing as accommodates is determined by the number of beds and bedrooms in a property. Thus, we decide to only add in the Accommodates variable into our model.
Bathrooms demonstrates a relatively weak correlation coefficient with accommodates, but logically you would expect an accommodation with more bathrooms to cost more per person, when we control for the accommodates factor. Price is probably more sensitive to this variable when the number for accommodates is very high however.
model3 <- lm(lnprice_4_nights ~ room_type+number_of_reviews+review_scores_rating
+accommodates+bathrooms, data = vienna_train)
msummary(model3) Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.706e+00 1.524e-01 24.319 < 2e-16 ***
room_typeEntire home/apt 2.978e-01 1.275e-01 2.336 0.01954 *
room_typePrivate room -4.790e-01 1.911e-02 -25.060 < 2e-16 ***
room_typeShared room -9.443e-01 1.277e-01 -7.397 1.78e-13 ***
number_of_reviews -2.451e-04 8.218e-05 -2.982 0.00288 **
review_scores_rating 2.916e-01 3.164e-02 9.215 < 2e-16 ***
accommodates 9.319e-02 4.347e-03 21.438 < 2e-16 ***
bathrooms 1.397e-01 2.195e-02 6.366 2.23e-10 ***
Residual standard error: 0.3588 on 3092 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.3757, Adjusted R-squared: 0.3743
F-statistic: 265.9 on 7 and 3092 DF, p-value: < 2.2e-16
kable(car::vif(model3)) %>%
kable_styling()| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| room_type | 1.162031 | 3 | 1.025344 |
| number_of_reviews | 1.008566 | 1 | 1.004274 |
| review_scores_rating | 1.015483 | 1 | 1.007712 |
| accommodates | 1.226595 | 1 | 1.107517 |
| bathrooms | 1.110464 | 1 | 1.053786 |
autoplot(model3) +
theme_minimal() +
labs (title = "Model 3 Diagnostic Plots")Adjusted R-squared is 0.3743, greatly improved by factoring in accommodates and bathrooms, both of which are statistically significant (\(\alpha\) =5%).
Note the low multicollinearity amongst the regressors after the removal of the property type variable.
We now want to test if superhosts command a pricing premium, after controlling for other variables.
model4 <- lm(lnprice_4_nights ~ room_type+number_of_reviews+review_scores_rating
+accommodates+bathrooms+host_is_superhost, data = vienna_train)
msummary(model4) Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.040e+00 1.641e-01 24.621 < 2e-16 ***
room_typeEntire home/apt 3.232e-01 1.270e-01 2.546 0.010955 *
room_typePrivate room -4.733e-01 1.905e-02 -24.844 < 2e-16 ***
room_typeShared room -9.274e-01 1.271e-01 -7.296 3.76e-13 ***
number_of_reviews -3.200e-04 8.297e-05 -3.857 0.000117 ***
review_scores_rating 2.184e-01 3.436e-02 6.355 2.39e-10 ***
accommodates 9.256e-02 4.332e-03 21.369 < 2e-16 ***
bathrooms 1.305e-01 2.191e-02 5.956 2.87e-09 ***
host_is_superhostTRUE 7.652e-02 1.456e-02 5.256 1.58e-07 ***
Residual standard error: 0.3571 on 3089 degrees of freedom
(13 observations deleted due to missingness)
Multiple R-squared: 0.3816, Adjusted R-squared: 0.38
F-statistic: 238.2 on 8 and 3089 DF, p-value: < 2.2e-16
kable(car::vif(model4)) %>%
kable_styling()| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| room_type | 1.167476 | 3 | 1.026143 |
| number_of_reviews | 1.037542 | 1 | 1.018598 |
| review_scores_rating | 1.208526 | 1 | 1.099330 |
| accommodates | 1.228944 | 1 | 1.108577 |
| bathrooms | 1.116155 | 1 | 1.056482 |
| host_is_superhost | 1.246227 | 1 | 1.116345 |
autoplot(model4) +
theme_minimal() +
labs (title = "Model 4 Diagnostic Plots")Adjusted R-squared is 0.38, a barely improved by including the Boolean host_is_superhost variable. However, this regressor is statistically significant, and positive as expected. We decide to continue with the variable as it is extremely significant, and wait until we control for some additional regressors.
Note that the lm() function automatically discards rows with missing values for the explanatory variables. We were reluctant to remove some of these rows earlier, as we did not know if some of these variables would be statistically significant predictors of the target variable,; and thus, we may have discarded valuable data unnecessarily if they were not useful.
We now want to test if the offering by some hosts allowing you to immediately book their listing is a significant predictor of the taget variable after controlling for other variables.
model5 <- lm(lnprice_4_nights ~ room_type+number_of_reviews+review_scores_rating+accommodates+bathrooms+host_is_superhost+instant_bookable, data = vienna_train)
msummary(model5) Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.089e+00 1.676e-01 24.394 < 2e-16 ***
room_typeEntire home/apt 3.211e-01 1.269e-01 2.530 0.011462 *
room_typePrivate room -4.759e-01 1.914e-02 -24.871 < 2e-16 ***
room_typeShared room -9.369e-01 1.273e-01 -7.361 2.32e-13 ***
number_of_reviews -3.203e-04 8.296e-05 -3.861 0.000115 ***
review_scores_rating 2.102e-01 3.483e-02 6.034 1.79e-09 ***
accommodates 9.300e-02 4.342e-03 21.420 < 2e-16 ***
bathrooms 1.299e-01 2.191e-02 5.930 3.37e-09 ***
host_is_superhostTRUE 7.790e-02 1.459e-02 5.340 9.99e-08 ***
instant_bookableTRUE -1.909e-02 1.336e-02 -1.429 0.153214
Residual standard error: 0.3571 on 3088 degrees of freedom
(13 observations deleted due to missingness)
Multiple R-squared: 0.382, Adjusted R-squared: 0.3802
F-statistic: 212.1 on 9 and 3088 DF, p-value: < 2.2e-16
kable(car::vif(model5)) %>%
kable_styling()| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| room_type | 1.181293 | 3 | 1.028157 |
| number_of_reviews | 1.037548 | 1 | 1.018601 |
| review_scores_rating | 1.242236 | 1 | 1.114556 |
| accommodates | 1.234951 | 1 | 1.111283 |
| bathrooms | 1.116530 | 1 | 1.056660 |
| host_is_superhost | 1.251743 | 1 | 1.118813 |
| instant_bookable | 1.053646 | 1 | 1.026472 |
autoplot(model5) +
theme_minimal() +
labs (title = "Model 5 Diagnostic Plots")We now test the predictive ability of the neighbourhood_simplified variable which we created earlier.
model6 <- lm(lnprice_4_nights ~ room_type+neighbourhood_simplified+ number_of_reviews+review_scores_rating+accommodates+bathrooms+host_is_superhost, data = vienna_train)
msummary(model6) Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.120e+00 1.610e-01 25.598 < 2e-16 ***
room_typeEntire home/apt 7.512e-02 1.182e-01 0.636 0.52497
room_typePrivate room -4.612e-01 1.867e-02 -24.708 < 2e-16 ***
room_typeShared room -4.826e-01 1.654e-01 -2.918 0.00355 **
neighbourhood_simplifiedT5 -1.689e-01 1.916e-02 -8.818 < 2e-16 ***
neighbourhood_simplifiedT4 -1.500e-01 1.922e-02 -7.804 8.51e-15 ***
neighbourhood_simplifiedT2 4.415e-02 1.662e-02 2.656 0.00795 **
neighbourhood_simplifiedInnere Stadt 3.834e-01 2.755e-02 13.919 < 2e-16 ***
number_of_reviews -4.326e-04 8.167e-05 -5.297 1.27e-07 ***
review_scores_rating 2.081e-01 3.372e-02 6.173 7.72e-10 ***
accommodates 9.417e-02 4.330e-03 21.749 < 2e-16 ***
bathrooms 1.304e-01 2.137e-02 6.103 1.19e-09 ***
host_is_superhostTRUE 7.957e-02 1.436e-02 5.541 3.30e-08 ***
Residual standard error: 0.3298 on 2723 degrees of freedom
(375 observations deleted due to missingness)
Multiple R-squared: 0.4799, Adjusted R-squared: 0.4777
F-statistic: 209.4 on 12 and 2723 DF, p-value: < 2.2e-16
kable(car::vif(model6)) %>%
kable_styling()| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| room_type | 1.183425 | 3 | 1.028466 |
| neighbourhood_simplified | 1.057168 | 4 | 1.006973 |
| number_of_reviews | 1.052265 | 1 | 1.025800 |
| review_scores_rating | 1.207981 | 1 | 1.099082 |
| accommodates | 1.237923 | 1 | 1.112620 |
| bathrooms | 1.127118 | 1 | 1.061658 |
| host_is_superhost | 1.251711 | 1 | 1.118799 |
autoplot(model6) +
theme_minimal() +
labs (title = "Model 6 Diagnostic Plots")+
theme(text = element_text(size=6))The introduction of the neighbourhood_simplified variable improves the adjusted R squared greatly to 0.48. Each bucket we computed is a statistically significant dummy variable at the 5% level of significance. However, we lose 362 additional observations. We decide that the increase in explanatory power is worth the loss in data as we still have a sufficient degrees of freedom to avoid over-fitting.
The inclusion of the neighbourhood_simplified variable however results in the “Entire home/apt” category of the room_type variable to become statistically insignificant i.e. the mean of the target variable is not statistically significantly different for an entire home/apartment than the baseline room_type in the model (Hotel room) when controlling for other variables. So we should group this category into the baseline category based on this. There must exist a relationship between the neighbourhoods and the room_type variable, which is understandable as more central areas will have more hotels and apartments on average when compared with areas outside/on the outskirts of the city.
# First we need to summarize the specified room_type category into the baseline for training and testing dataset, then we re-factorise them
vienna_train <- vienna_train %>%
mutate(room_types = case_when(
room_type %in% c("Hotel room", "Entire home/apt") ~ "Entire property",
room_type %in% c("Private room") ~ "Private room",
room_type %in% c("Shared room") ~ "Shared room"
))
# Now we need to factorise the new variable
vienna_train <- vienna_train %>%
mutate(
room_types = fct_relevel(room_types,
"Entire property",
"Private room",
"Shared room"))
vienna_test <- vienna_test %>%
mutate(room_types = case_when(
room_type %in% c("Hotel room", "Entire home/apt") ~ "Entire property",
room_type %in% c("Private room") ~ "Private room",
room_type %in% c("Shared room") ~ "Shared room"
))
# Now we need to factorise the new variable
vienna_test <- vienna_test %>%
mutate(
room_types = fct_relevel(room_types,
"Entire property",
"Private room",
"Shared room"))Now we fit model 6 again with the new room_type grouping.
model6 <- lm(lnprice_4_nights ~ room_types+neighbourhood_simplified+ number_of_reviews+review_scores_rating+accommodates+bathrooms+host_is_superhost, data = vienna_train)
msummary(model6) Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.123e+00 1.609e-01 25.627 < 2e-16 ***
room_typesPrivate room -4.616e-01 1.865e-02 -24.746 < 2e-16 ***
room_typesShared room -4.830e-01 1.654e-01 -2.920 0.00352 **
neighbourhood_simplifiedT5 -1.690e-01 1.916e-02 -8.824 < 2e-16 ***
neighbourhood_simplifiedT4 -1.498e-01 1.921e-02 -7.799 8.84e-15 ***
neighbourhood_simplifiedT2 4.414e-02 1.662e-02 2.656 0.00796 **
neighbourhood_simplifiedInnere Stadt 3.854e-01 2.737e-02 14.080 < 2e-16 ***
number_of_reviews -4.303e-04 8.158e-05 -5.274 1.44e-07 ***
review_scores_rating 2.077e-01 3.371e-02 6.161 8.30e-10 ***
accommodates 9.399e-02 4.320e-03 21.756 < 2e-16 ***
bathrooms 1.307e-01 2.137e-02 6.115 1.10e-09 ***
host_is_superhostTRUE 7.922e-02 1.435e-02 5.521 3.69e-08 ***
Residual standard error: 0.3297 on 2724 degrees of freedom
(375 observations deleted due to missingness)
Multiple R-squared: 0.4799, Adjusted R-squared: 0.4778
F-statistic: 228.5 on 11 and 2724 DF, p-value: < 2.2e-16
kable(car::vif(model6)) %>%
kable_styling()| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| room_types | 1.156085 | 2 | 1.036925 |
| neighbourhood_simplified | 1.041761 | 4 | 1.005127 |
| number_of_reviews | 1.050158 | 1 | 1.024772 |
| review_scores_rating | 1.207406 | 1 | 1.098820 |
| accommodates | 1.232761 | 1 | 1.110298 |
| bathrooms | 1.126792 | 1 | 1.061505 |
| host_is_superhost | 1.249866 | 1 | 1.117974 |
autoplot(model6) +
theme_minimal() +
labs (title = "Model 6 Diagnostic Plots")+
theme(text = element_text(size=6))We have been using the number of reviews variable, but we are cautious that it may not be the most current reviews variable in the dataset, especially considering its miniscule coefficient. We now want to select which of the review number variables we should use. We look at the correlation matrix between number_of_reviews, reviews_per_month, number_of_reviews_ltm. We also want to test the effect of including the avalability_30 variable on the model’s explanatory power, as it could be a proxy for general demand.
vienna_train %>%
select(lnprice_4_nights, number_of_reviews, reviews_per_month, number_of_reviews_ltm) %>%
ggpairs()+
labs(title = "Correlation Matrix")We now fit the model with availability_30 included and number of reviews removed.
model7 <- lm(lnprice_4_nights ~ room_types+neighbourhood_simplified+review_scores_rating+accommodates+bathrooms+host_is_superhost+availability_30, data = vienna_train)
msummary(model7) Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.6447551 0.1567652 23.250 < 2e-16 ***
room_typesPrivate room -0.4398531 0.0179669 -24.481 < 2e-16 ***
room_typesShared room -0.5690356 0.1592463 -3.573 0.000359 ***
neighbourhood_simplifiedT5 -0.1733603 0.0184080 -9.418 < 2e-16 ***
neighbourhood_simplifiedT4 -0.1409184 0.0184791 -7.626 3.33e-14 ***
neighbourhood_simplifiedT2 0.0496294 0.0159820 3.105 0.001920 **
neighbourhood_simplifiedInnere Stadt 0.3836262 0.0263297 14.570 < 2e-16 ***
review_scores_rating 0.2869220 0.0327441 8.763 < 2e-16 ***
accommodates 0.0880354 0.0041754 21.084 < 2e-16 ***
bathrooms 0.1374545 0.0205616 6.685 2.79e-11 ***
host_is_superhostTRUE 0.0577977 0.0136433 4.236 2.35e-05 ***
availability_30 0.0100769 0.0006399 15.748 < 2e-16 ***
Residual standard error: 0.3173 on 2724 degrees of freedom
(375 observations deleted due to missingness)
Multiple R-squared: 0.5184, Adjusted R-squared: 0.5165
F-statistic: 266.6 on 11 and 2724 DF, p-value: < 2.2e-16
kable(car::vif(model7)) %>%
kable_styling()| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| room_types | 1.159925 | 2 | 1.037785 |
| neighbourhood_simplified | 1.031108 | 4 | 1.003837 |
| review_scores_rating | 1.230617 | 1 | 1.109332 |
| accommodates | 1.243520 | 1 | 1.115132 |
| bathrooms | 1.127023 | 1 | 1.061613 |
| host_is_superhost | 1.220213 | 1 | 1.104633 |
| availability_30 | 1.050895 | 1 | 1.025131 |
autoplot(model7) +
theme_minimal() +
labs(title = "Model 7 Diagnostic Plots")+
theme(text = element_text(size=6))Availability_30 is statistically significant at the 5% level of significance. It also increases our model’s adjusted R squared by over 0.04 to 0.52, which is a welcome boost. Multicollinearity is relatively low in this model as well which gives us confidence in the approximate independence of our explanatory variables. If availability_30 increases by 1, we would expect, based on the model, that price for two people for four nights would increase by c. 1% on average. This variable is statistically significant at the 5% level of significance.
After running the specified regressions, we must look at the remaining variables we have not used so far to potentially optimise our models explanatory value further.
colnames(vienna_train %>%
select(-c(lnprice_4_nights,prop_type_simplified,neighbourhood_simplified,review_scores_rating,accommodates,host_is_superhost,availability_30,number_of_reviews, reviews_per_month, number_of_reviews_ltm,instant_bookable,bathrooms, bedrooms, beds,room_type, minimum_nights, maximum_nights, lnprice, dist_from_cent, price_4_nights, availability_365,price,room_types)))[1] "host_response_time" "host_response_rate"
[3] "host_acceptance_rate" "host_total_listings_count"
[5] "host_identity_verified"
Relevant variables not tested yet:
We decide to test the host_identity_verified Boolean variable, as we can logically hypothesise that people could potentially pay a premium for the security of the booking. However, the host may not demand a premium for being verified, since it is strictly the seller which determines market price here.
model8 <- lm(lnprice_4_nights ~ room_types+neighbourhood_simplified+review_scores_rating+accommodates+bathrooms+host_is_superhost+availability_30+host_identity_verified, data = vienna_train)
msummary(model8) Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.6343325 0.1573285 23.100 < 2e-16 ***
room_typesPrivate room -0.4385001 0.0180494 -24.294 < 2e-16 ***
room_typesShared room -0.5647267 0.1593503 -3.544 0.000401 ***
neighbourhood_simplifiedT5 -0.1724669 0.0184439 -9.351 < 2e-16 ***
neighbourhood_simplifiedT4 -0.1397172 0.0185426 -7.535 6.61e-14 ***
neighbourhood_simplifiedT2 0.0502802 0.0160042 3.142 0.001698 **
neighbourhood_simplifiedInnere Stadt 0.3836471 0.0263315 14.570 < 2e-16 ***
review_scores_rating 0.2871503 0.0327477 8.769 < 2e-16 ***
accommodates 0.0878500 0.0041823 21.005 < 2e-16 ***
bathrooms 0.1365153 0.0205973 6.628 4.09e-11 ***
host_is_superhostTRUE 0.0566953 0.0137152 4.134 3.68e-05 ***
availability_30 0.0100626 0.0006402 15.718 < 2e-16 ***
host_identity_verifiedTRUE 0.0131092 0.0165703 0.791 0.428940
Residual standard error: 0.3173 on 2723 degrees of freedom
(375 observations deleted due to missingness)
Multiple R-squared: 0.5185, Adjusted R-squared: 0.5164
F-statistic: 244.4 on 12 and 2723 DF, p-value: < 2.2e-16
kable(car::vif(model8)) %>%
kable_styling()| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| room_types | 1.171649 | 2 | 1.040398 |
| neighbourhood_simplified | 1.040134 | 4 | 1.004931 |
| review_scores_rating | 1.230712 | 1 | 1.109375 |
| accommodates | 1.247435 | 1 | 1.116886 |
| bathrooms | 1.130779 | 1 | 1.063381 |
| host_is_superhost | 1.232940 | 1 | 1.110378 |
| availability_30 | 1.051727 | 1 | 1.025537 |
| host_identity_verified | 1.050888 | 1 | 1.025128 |
autoplot(model8) +
theme_minimal() +
labs (title = "Model 8 Diagnostic Plots")+
theme(text = element_text(size=6))The host identity verification variable is not statistically significant at the 5% level of significance, so we discard it.
We have now exhausted our supply of explanatory variables. We can say so far that Model 7 is the best in terms of adjusted R squared so far, but we will compare the models further below using AIC, residual standard error, and out-of-sample RMSE after. We will then perform diagnostics tests on the residuals, as if the distribution of the residuals violates our OLS assumptions, we must re-select the models, or perform heteroskedasticity and autocorrelation robust regressions.
models_compare <- huxreg(model1, model2, model3, model4, model5, model6, model7,model8,
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma',
'AIC' = 'AIC'),
bold_signif = 0.05,
stars = NULL
) %>%
set_caption('Comparison of models')
models_compare| (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | |
|---|---|---|---|---|---|---|---|---|
| (Intercept) | 4.377 | 4.408 | 3.706 | 4.040 | 4.089 | 4.123 | 3.645 | 3.634 |
| (0.165) | (0.164) | (0.152) | (0.164) | (0.168) | (0.161) | (0.157) | (0.157) | |
| prop_type_simplifiedPrivate room in rental unit | -0.624 | -0.081 | ||||||
| (0.021) | (0.078) | |||||||
| prop_type_simplifiedEntire apartment | 0.060 | 0.061 | ||||||
| (0.029) | (0.028) | |||||||
| prop_type_simplifiedOther | -0.173 | 0.164 | ||||||
| (0.036) | (0.058) | |||||||
| number_of_reviews | -0.000 | -0.000 | -0.000 | -0.000 | -0.000 | -0.000 | ||
| (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | |||
| review_scores_rating | 0.256 | 0.249 | 0.292 | 0.218 | 0.210 | 0.208 | 0.287 | 0.287 |
| (0.035) | (0.034) | (0.032) | (0.034) | (0.035) | (0.034) | (0.033) | (0.033) | |
| room_typeEntire home/apt | -0.027 | 0.298 | 0.323 | 0.321 | ||||
| (0.150) | (0.127) | (0.127) | (0.127) | |||||
| room_typePrivate room | -0.543 | -0.479 | -0.473 | -0.476 | ||||
| (0.075) | (0.019) | (0.019) | (0.019) | |||||
| room_typeShared room | -0.918 | -0.944 | -0.927 | -0.937 | ||||
| (0.150) | (0.128) | (0.127) | (0.127) | |||||
| accommodates | 0.093 | 0.093 | 0.093 | 0.094 | 0.088 | 0.088 | ||
| (0.004) | (0.004) | (0.004) | (0.004) | (0.004) | (0.004) | |||
| bathrooms | 0.140 | 0.130 | 0.130 | 0.131 | 0.137 | 0.137 | ||
| (0.022) | (0.022) | (0.022) | (0.021) | (0.021) | (0.021) | |||
| host_is_superhostTRUE | 0.077 | 0.078 | 0.079 | 0.058 | 0.057 | |||
| (0.015) | (0.015) | (0.014) | (0.014) | (0.014) | ||||
| instant_bookableTRUE | -0.019 | |||||||
| (0.013) | ||||||||
| room_typesPrivate room | -0.462 | -0.440 | -0.439 | |||||
| (0.019) | (0.018) | (0.018) | ||||||
| room_typesShared room | -0.483 | -0.569 | -0.565 | |||||
| (0.165) | (0.159) | (0.159) | ||||||
| neighbourhood_simplifiedT5 | -0.169 | -0.173 | -0.172 | |||||
| (0.019) | (0.018) | (0.018) | ||||||
| neighbourhood_simplifiedT4 | -0.150 | -0.141 | -0.140 | |||||
| (0.019) | (0.018) | (0.019) | ||||||
| neighbourhood_simplifiedT2 | 0.044 | 0.050 | 0.050 | |||||
| (0.017) | (0.016) | (0.016) | ||||||
| neighbourhood_simplifiedInnere Stadt | 0.385 | 0.384 | 0.384 | |||||
| (0.027) | (0.026) | (0.026) | ||||||
| availability_30 | 0.010 | 0.010 | ||||||
| (0.001) | (0.001) | |||||||
| host_identity_verifiedTRUE | 0.013 | |||||||
| (0.017) | ||||||||
| #observations | 3111 | 3111 | 3100 | 3098 | 3098 | 2736 | 2736 | 2736 |
| R squared | 0.238 | 0.256 | 0.376 | 0.382 | 0.382 | 0.480 | 0.518 | 0.519 |
| Adj. R Squared | 0.237 | 0.254 | 0.374 | 0.380 | 0.380 | 0.478 | 0.516 | 0.516 |
| Residual SE | 0.397 | 0.392 | 0.359 | 0.357 | 0.357 | 0.330 | 0.317 | 0.317 |
| AIC | 3081.534 | 3012.724 | 2452.743 | 2423.248 | 2423.201 | 1707.104 | 1496.520 | 1497.892 |
# Get list of adjusted R squared values
ar2_comparison = data.frame(summary(model1)$adj.r.squared,summary(model2)$adj.r.squared,summary(model3)$adj.r.squared,summary(model4)$adj.r.squared,summary(model5)$adj.r.squared,summary(model6)$adj.r.squared,summary(model7)$adj.r.squared,summary(model8)$adj.r.squared)# Get out of sample RMSE
RMSE1 = 100*sd((predict(model1, vienna_test)-vienna_test$lnprice_4_nights),na.rm=TRUE)
RMSE2 = 100*sd((predict(model2, vienna_test)-vienna_test$lnprice_4_nights),na.rm=TRUE)
RMSE3 = 100*sd((predict(model3, vienna_test)-vienna_test$lnprice_4_nights),na.rm=TRUE)
RMSE4 = 100*sd((predict(model4, vienna_test)-vienna_test$lnprice_4_nights),na.rm=TRUE)
RMSE5 = 100*sd((predict(model5, vienna_test)-vienna_test$lnprice_4_nights),na.rm=TRUE)
RMSE6 = 100*sd((predict(model6, vienna_test)-vienna_test$lnprice_4_nights),na.rm=TRUE)
RMSE7 = 100*sd((predict(model7, vienna_test)-vienna_test$lnprice_4_nights),na.rm=TRUE)
RMSE8 = 100*sd((predict(model8, vienna_test)-vienna_test$lnprice_4_nights),na.rm=TRUE)
RMSE_compare = data.frame(RMSE1, RMSE2, RMSE3, RMSE4, RMSE5, RMSE6, RMSE7, RMSE8)
kable(RMSE_compare)%>%
kable_styling()| RMSE1 | RMSE2 | RMSE3 | RMSE4 | RMSE5 | RMSE6 | RMSE7 | RMSE8 |
|---|---|---|---|---|---|---|---|
| 39.79635 | 39.06386 | 36.74797 | 36.74612 | 36.70792 | 33.97237 | 32.45942 | 32.45975 |
Adjusted R-Squared
Residual Standard Error
AIC
Out-of-sample RMSE
Residuals
Model 7’s residuals appear to be randomly scattered around its mean of 0, and look approximately normal as they appear to simulate a white noise distribution with no clear pattern as the target variable varies. The QQ plot is approximately along the 45 degree line also, with most residual quantiles well-fitted (note some deviation for extreme values however). It does not appear that we require heteroskedasticity robust standard errors.
Therefore, we decide to select model 7 as our optimal model for the price for two people to stay for four nights in Vienna. On average, this model should give us the most accurate predictions for average property rental price for the given input variables with the lowest standard errors amongst the fitted models.
Finally, we use our optimised model for predictions. The model reads:
\(Y = 3.64 - 0.44(Private\ Room) - 0.57(Shared\ Room) - 0.17(T5) - 0.13(T4) + 0.05(T2) + 0.38(Innere\ Stadt) + 0.29(Rating) + 0.09(Accommodates) + 0.14(Bathrooms) + 0.06(Superhost) + 0.01(Availability\_30) + \epsilon_i\)
Interpretations of Coefficients (when controling for other variables)
Private Room: the negative coefficient shows that you should expect to pay c. 44% less for a private room in a shared accommodation as opposed to a hotel room for four nights (which is the baseline)
Shared Room: you should expect to pay c. 57% less for a private room in a shared room as opposed to a hotel room (which is the baseline) for four nights
Logic: travelers pay a premium for privacy
T5: you should expect to pay c. 17% less to stay in a T5 neighbourhood (bottom 5 nighbourhoods by mean price) for four nights than a T3 neighbourhood (baseline)
T4: you should expect to pay c. 13% less to stay in a T4 neighbourhood for four nights than a T3 neighbourhood (baseline)
T2: you should expect to pay c. 5% more to stay in a T2 neighbourhood for four nights than a T3 neighbourhood (baseline)
Innere Stadt: you should expect to pay c. 38% more to stay in the Innere Stadt neighbourhood for four nights than a T3 neighbourhood (baseline)
Logic: location is one of the primary considerations for travelers staying in Vienna
Rating: you should expect to pay c. 29% more on average to stay in a property with a rating of one point higher than another for four nights, all else equal
Accommodates: you should expect to pay c. 9% more on average to stay in a property which can accommodate one more person than another for four nights, all else equal
Bathrooms: you should expect to pay c. 14% more on average to stay in a property with one more bathroom than another for four nights, all else equal
Superhost: you should expect to pay c. 6% more to stay in a property listed by a superhost on average, all else equal
Availability_30: you should expect to pay c. 1% more on average to stay in a property which is available for one more day than another property in the next thirty days, all else equal
Logic: overpriced accommodation is likely to be a deterrent of demand
We first predict the price for a private room, with at least 10 reviews, and an average rating of at least 90 (4.5/5.0). We also assume that the property accommodates 2 people, has one bathroom, and has the average availability_30 in the entire cleaned dataset. We predict for each of the five neighbourhood buckets from most to least expensive We use the model to predict the total cost to stay at this Airbnb for 4 nights.
imaginary_stay1 <- data_frame(room_types="Private room",
neighbourhood_simplified = c("Innere Stadt","T2", "T3", "T4", "T5"), # vary at will
review_scores_rating = 4.5:5,
accommodates=2,
bathrooms=1,
host_is_superhost=TRUE,
availability_30=mean(vienna_listings_reg$availability_30))
pred1=data.frame(exp(predict.lm(model7, imaginary_stay1, interval = "confidence")), row.names = c("Innere Stadt","T2", "T3", "T4", "T5"))
kable(pred1)%>%
kable_styling()| fit | lwr | upr | |
|---|---|---|---|
| Innere Stadt | 208.2214 | 195.3977 | 221.8867 |
| T2 | 149.0982 | 142.3217 | 156.1974 |
| T3 | 141.8792 | 135.4586 | 148.6040 |
| T4 | 123.2306 | 117.2191 | 129.5503 |
| T5 | 119.2969 | 113.4989 | 125.3911 |
The table above gives the 95% confidence interval and expectation of the price for two people to stay for four nights with the other criteria mentioned for each of the simplified neighbourhood buckets. This demonstrates the significant difference between the neighbourhoods, with Innere Stadt the obvious positive outlier in terms of price. This table gives the user the location/price trade-off of staying in a nicer or potentially better located area, with the given assumptions above.
#plotting graph using geom_errorbar
difference_pred1<- pred1 %>%
ggplot(aes(color=row.names(pred1))) +
geom_errorbar(aes(y=row.names(pred1),xmin=lwr,xmax= upr),width=0.1,size=1.5,show.legend = FALSE) +
geom_point(aes(x=fit,y=row.names(pred1)),size=5,show.legend = FALSE)+
geom_text(aes(label=round(lwr,digits=2),x=round(lwr,digits=2),y=row.names(pred1)),size=3,color="black",hjust=0.5,vjust=-0.75,nudge_x=0.01,nudge_y=0.08,fontface = "bold")+
geom_text(aes(label=round(upr,digits=2),x=round(upr,digits=2),y=row.names(pred1)),size=3,color="black",hjust=0.5,vjust=-0.75,nudge_x=0.01,nudge_y=0.08,fontface = "bold")+
geom_text(aes(label=round(fit,digits=2),x=round(fit,digits=2),y=row.names(pred1)),size=3,color="black",hjust=0.4,vjust=-0.75,nudge_x=0.01,nudge_y=0.08,fontface = "bold")+
theme(legend.position="none",legend.title = element_blank())+
labs(
title = "Prediction by Grouped Neighbourhood - Vienna",
subtitle = "95% Confidence Intervals",
x = "Prediction Price for Two People for Four Nights",
y = "Neighbourhoods"
)+
theme_economist()+
NULL
difference_pred1Note that the 95% confidence intervals for the predictions for T5 and T4 overlap, as do T3 and T2. However, they are statistically significant predictors so we are not alarmed.
Though we are quite happy with our results and the explanatory power of the final model, we must note potential improvements to the process/model:
The data obtained was flawed (though not fatally flawed!) in numerous ways. The biggest issue with the data was the volume of missing values for potential explanatory variables, which can increase the standard errors of the parameter estimates. Ideally, we would like to maximise the amount of datapoints when fitting our model, but unfortunately this was negated by the inclusion of certain independent variables. The amount of transformations required and outliers in the data also can increase our margin of error in fitting and interpreting the data.
The subjectivity involved in truncating the tails of the data can take away from the scientific method employed. Other sources of subjectivity/bias include the grouping of categories when simplifying variables and omitting properties with fewer than ten reviews. This non-exact, though logic-based, decision-making can lead to bias in the model.
The inclusion of twelve coefficient estimates (including the intercept) leaves us prone to over-fitting errors which can increase out-of-sample variance in the residuals of the model. It also leaves us prone to estimation error which can void a model’s usefulness in real life. However, we exercised caution consistently throughout the model selection process when analysing the tradeoff between bias and variance when including additional regressors.